perf/row-major trace LDE for GPU path#715
Conversation
…via on-device transpose" This reverts commit 38f5600.
|
Review: row-major trace LDE for GPU path Solid, well-documented change with good parity-test coverage (Merkle root + barycentric, base & ext3). The row-major NTT/Keccak/transpose kernels look correct: grid-stride loops are bounded, the bit-reverse swap is race-free (the High
Low
The instruments relabeling and the two_half_fft/polynomial comment trims are fine. |
AI ReviewPR #715 · 15 changed files Findings
Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro). AI-001: gpu_and_cpu_proofs_both_verify test silently uses GPU for both proofs
Claim The new Evidence crypto/stark/src/gpu_lde.rs:43-51 declares Suggested fix Either change AI-002: Returned GPU LDE handle is unsafe to read across streams (missing synchronize after transpose)
Claim coset_lde_row_major_with_merkle_tree_keep and coset_lde_ext3_row_major_with_merkle_tree_keep return GpuLdeBase/Ext3 handles whose device buffer is populated by a final matrix_transpose_strided kernel without a stream synchronize. Downstream R3/R4 barycentric and DEEP consumers call backend().next_stream() and immediately read the handle buffer on that new stream. Because Backend disables cudarc event tracking and no explicit cudaEventRecord/cudaStreamWaitEvent is used, kernels on different pool streams can overlap. This creates a data race where the consumer may read the LDE buffer before the producer's transpose finishes. Evidence launch_row_to_col_major is documented with "No synchronize — callers on the same stream are ordered; other streams must synchronize themselves" and returns right after stream.launch_builder(...).launch(...). The LDE-row-major D2H and node D2H both call stream.synchronize(), but nothing synchronizes after the transpose. Downstream math_cuda::barycentric::barycentric_base_on_device and the r3_ctx variant call let stream = be.next_stream() and then launch kernels reading main_handle.buf/aux_handle.buf. Suggested fix Add stream.synchronize()? after launch_row_to_col_major in both coset_lde_row_major_with_merkle_tree_keep and coset_lde_ext3_row_major_with_merkle_tree_keep, before returning the handle. If the intended design is to keep the transpose async, attach a CUDA event to the producing stream and make consumers wait on it, but synchronizing here is the simpler and safer fix. AI-005: No direct GPU parity test for the new row-major LDE pipeline
Claim The new Evidence merkle_root_parity.rs Suggested fix Add unit parity tests that call AI-006: try_expand_leaf_and_tree_batched_keep and _ext3_keep are dead code
Claim
Evidence A grep across the workspace for Suggested fix Remove try_expand_leaf_and_tree_batched_keep, try_expand_leaf_and_tree_batched_ext3_keep, their math-cuda callees coset_lde_batch_base_into_with_merkle_tree_keep and coset_lde_batch_ext3_into_with_merkle_tree_keep if also unused, and the now-unused columns_to_row_major helper in prover.rs. AI-010: Implicit u32 truncation of `m` / `n` in row-major launch configs (no domain check)
Claim
Evidence lde.rs:280-302 casts Suggested fix Add AI-016: Potentially unsafe std::env mutation in tests without --test-threads=1
Claim The new test mutates the process-wide env var LAMBDA_VM_GPU_LDE_THRESHOLD via Evidence std::env::set_var is Suggested fix Run only this test ( Reviewer Lanes
Verification Lanes
Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report. Discarded candidates (1) — rejected by the verifier
Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts. |
The FxHasher/FxHashMap op-dedup micro-optimization is unrelated to the row-major GPU LDE rework and was only applied to 4 of 6 dedup tables. Revert the table maps to std HashMap and drop the hasher; it can land as its own focused PR.
The GPU full path is covered by the normal prove/verify suite built with --features cuda (plus gpu_path_fires_end_to_end), the CPU path by the non-cuda suite, and GPU/CPU equivalence by the merkle/barycentric parity tests. Its force-CPU leg also never ran on CPU: gpu_lde_threshold() only re-read the env var under cfg(test), but from the prover integration crate stark compiles without cfg(test), so the OnceLock cached the first value. Simplify gpu_lde_threshold() to a single cached impl now that the per-call re-read has no consumer.
keccak.cu: move keccak256_leaves_base_row_major out of keccak_merkle_level's doc block so the child-pair->parent doc rejoins its kernel. prover.rs: delete columns_to_row_major, which has no callers after the row-major GPU path stopped materializing GPU-expanded columns.
Extract coset_lde_row_major_inner shared by the base and ext3 _keep entry points (they differed only by m vs m*3 and the handle type), removing ~110 lines of drift-prone duplication. Add debug_assert!(num_rows >= 2) to launch_keccak_base_row_major: the kernel shifts by (64 - log_num_rows), UB at num_rows==1, matching the guard in launch_keccak_base.
The assert checked gpu_parts_lde_calls() > 0 with a comment claiming branch/shift tables are degree-3 — both false: fib_iterative_1M tables all have number_of_parts <= 2, and the common degree-2 case fires the fused two-halves path (gpu_extend_halves_calls), counted separately from the parts>2 path (gpu_parts_lde_calls) since #700. Assert on the sum so either composition-LDE path satisfies it. Validated on RTX 5090 / CUDA 13.1: make test-math-cuda 78/78, make test-cuda-integration green, proof verifies.
Review fixes for #715: drop scope creep, dead code, redundant test; dedup row-major LDE
Resolves the conflict between #715 ("perf/row-major trace LDE for GPU path") and this PR's row-pair commitment. The two reworked the same GPU trace-LDE/commit functions for orthogonal goals: #715 made the LDE consume a row-major input buffer (memory-layout win, kept a per-row commitment); this PR groups two bit-reversed rows per Merkle leaf (one path per query, proof_sym removed). Resolution keeps BOTH: #715's row-major LDE input, now emitting a row-pair tree. CPU side (verifier, proof format, commitment.rs) merged cleanly to the row-pair form — it is the fixed target the GPU must match. GPU side: - New CUDA kernel `keccak256_leaves_base_row_major_row_pair` (+ launcher `launch_keccak_base_row_major_row_pair`, device registration): the row-major analog of `keccak256_leaves_base_row_pair_batched`. Each leaf hashes two consecutive bit-reversed rows of `m` u64 lanes (base m cols; ext3 m*3, since components c0,c1,c2 are consecutive). Byte layout matches the CPU `commit_bit_reversed(.., 2)` and `verify_opening_pair`. - `coset_lde_row_major_inner` now builds `lde_size/2` row-pair leaves via the new launcher (was per-row). This is the only producer for the GPU base + ext3 trace commit (the two `_keep` wrappers). - gpu_lde.rs trace fast paths take #715's row-major dispatch unchanged (row-pair-ness is internal to the math-cuda function above). Deleted as now-unused (superseded): - The per-row row-major path: kernel `keccak256_leaves_base_row_major`, its launcher, and device field (replaced by the row-pair version). - `alloc_merkle_nodes` and the column-based `_keep` wrappers in math-cuda lde.rs (this PR's earlier GPU path; #715's row-major path replaces them). - Dead `columns_to_row_major` in prover.rs (round 1 reads the trace row-major directly via `main_data_row_major`). Fixed main-side tests for this PR's API changes: - prover_tests: `commit_rows_bit_reversed_matches_commit_bit_reversed` now compares the row-major commit against `commit_bit_reversed(.., 2)` (commit_columns_bit_reversed was removed). Validates row-major == column-major row-pair roots on CPU. - merkle_root_parity: pass `rows_per_leaf = 1` to keccak_leaves_base/ext3 (generic per-row primitive parity test). Verified locally: cargo build/check (default + `--features cuda`, incl. tests), `cargo test -p stark` (137 pass), `make lint` (all combos). The new CUDA kernel's byte-level correctness is NOT runtime-tested here (no GPU); the `cuda_path_integration` + math-cuda GPU tests must be run on a GPU host to confirm GPU↔CPU↔verifier root parity.
Builds on #650. Ports the row major LDE rework to the GPU path. Round 1 GPU was still paying two host-side O(N×M) layout copies that #650 eliminates on CPU:
Both are replaced by a single H2D and a single D2H, with the NTT running natively on row-major data.
New kernels
bit_reverse_row_major,ntt_dit_level_row_major,pointwise_mul_row_major: NTT pipeline on row-major buffers.threadIdx.x = column→ consecutive threads access consecutive columns of the same row → coalesced. Grid-stride loops throughout (gridDim.y capped at 65535).keccak256_leaves_base_row_major: each leaf reads a contiguous row slice of M elements. Launched at 128 threads/block — the Keccak 25-elementuint64_tstate exceeds the register budget at 1024, causing silent CPU fallback.matrix_transpose_strided: tiled 32×32 transpose applied once per LDE call to convert the output back to column-major for theGpuLdeBase/GpuLdeExt3handle (DEEP, barycentric expect column-major).For ext3 (aux trace):
Fp3 = [u64; 3]in memory, so row-major ext3 with M columns is identical to row-major base-field withm = M×3— the same kernels handle both without modification.Results
fib_iterative_8M, RTX 5090 (32 GB) · 2× AMD EPYC 7R13 · CUDA 13.2 ·TABLE_PARALLELISMdefault = 15 (available_parallelism()/3), median of 15 runs:All in Round 1 (main −1.73s, aux −1.24s). Rounds 2–4 unchanged.
Cleanup
Removed a local
bit_reverse_vecduplicate intwo_half_fft.rs(replaced byin_place_bit_reverse_permute) and an inline leaf-hashing loop inprover.rs(replaced by the existingkeccak_leaves_bit_reversed).Two-half FFT
The cache-blocked two-half FFT from #650 was prototyped for GPU and measured at zero improvement. On CPU it wins by keeping sub-FFTs cache-resident, avoiding the large-stride memory traffic of the flat Bowers FFT. On GPU that bottleneck doesn't exist: the row-major NTT already achieves coalesced access at every level (consecutive threads read consecutive columns of the same row), so there is no large-stride penalty to eliminate. Twiddle precomputation was already in the GPU backend.
GPU/CPU equivalence
Tests confirm GPU and CPU produce identical Merkle roots and OOD evaluations for the same inputs:
merkle_root_parity.rs(base + ext3),barycentric_cpu_gpu_parity.rs(base + ext3),cuda_path_integration::gpu_and_cpu_proofs_both_verify.cuda_path_integration2/2 on hardware.